feat: add ScalaUDF support via a codegen dispatcher#4267
Draft
mbutrovich wants to merge 52 commits into
Draft
Conversation
This was referenced May 8, 2026
Contributor
Author
|
There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward. |
…ted body" on Spark 3.5
andygrove
reviewed
May 12, 2026
andygrove
reviewed
May 12, 2026
andygrove
reviewed
May 12, 2026
andygrove
reviewed
May 12, 2026
andygrove
reviewed
May 12, 2026
# Conflicts: # common/src/main/java/org/apache/comet/udf/CometUdfBridge.java # common/src/main/scala/org/apache/comet/udf/CometUDF.scala # docs/source/contributor-guide/index.md # native/core/src/execution/jni_api.rs # native/core/src/execution/planner.rs # native/spark-expr/src/jvm_udf/mod.rs # spark/src/main/scala/org/apache/comet/CometExecIterator.scala # spark/src/main/scala/org/apache/comet/serde/strings.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: a
CometUDF(CometScalaUDFCodegen) that compiles a specialized batch kernel per boundScalaUDFexpression and input schema via Janino. Without this path, any plan containing aScalaUDFfalls back to Spark for the enclosing operator, losing native execution on the surrounding plan.The dispatcher is one of potentially many
CometUDFimplementations the bridge can route to. Hand-writtenCometUDFs for specific expression families (e.g. regex in #4239, JSON in #4305) remain a parallel path; the bridge dispatches by class name from the proto and does not require everything to go through the dispatcher.Benefits:
ScalaUDFwhose argument and return types are in the supported surface routes through native without a hand-writtenCometUDF.ScalaUDFargument tree, so Catalyst sub-expressions inside the UDF (upper(s),concat(c1, c2),monotonically_increasing_id()) compile into the same per-row loop as the user function.Opt-in via
spark.comet.exec.scalaUDF.codegen.enabled(defaulttrue). When disabled, plans containing aScalaUDFfall back to Spark for that operator.The
CometUDFcontract loosens from "should be stateless" to "may hold per-task state in fields." One instance per Spark task attempt per class, reused across all batches of the task, dropped on task completion. Per-instance access is single-threaded because Spark runs one native future per partition and Tokio polls one future per worker at a time.What changes are included in this PR?
org.apache.comet.codegen:CometBatchKernelCodegen(orchestrator) +CometBatchKernelCodegenInput/CometBatchKernelCodegenOutput(per-side emission) +CometBatchKernelJava base +CometInternalRow/CometArrayData/CometMapDatashim bases +CometSpecializedGettersDispatchfor sharedget(ordinal, dataType)dispatch. The framework is generic over Catalyst expressions; today's only consumer is the ScalaUDF dispatcher.org.apache.comet.udf.codegen:CometScalaUDFCodegen(bridge entry, compile cache, per-partition kernel state).ArrayType,StructType, andMapTypeas both input and output, including arbitrary nesting. SealedArrowColumnSpecplus recursive nested-class emission.(expression, input schema): zero-copy UTF8 reads onVarCharVector, non-nullableisNullAtelision, decimal short-precision fast path on both sides, UTF8 on-heap write shortcut, pre-sized variable-length output buffers,NullIntolerantshort-circuit, non-nullable output short-circuit, nullable-element elision on array / map writes, subexpression elimination. Complex-type output writes hoistgetChildByOrdinaland cast to once-per-batch setup so the per-row body has no runtime type dispatch and no redundant casts. In-code TODOs flag three further optimizations the input side has and the output side does not yet (UTF8 inline-unsafe write, cached write-buffer addresses, nested var-width sizing).ConcurrentHashMap<Long, ConcurrentHashMap<String, CometUDF>>keyed by(taskAttemptId, className)with aTaskCompletionListenerevicting the per-task entry. Invariant to Tokio work-stealing across batches: a task that migrates between workers still sees the same instance. Assertions on every invariant (single listener registration, non-null cache, reflective-instantiate success,TaskContextinstall effect).CometScalaUDFroutes anyScalaUDFwhose tree passesCometBatchKernelCodegen.canHandle. Proto build is inlined; no other expressions adopt the dispatcher in this PR.Utils.toArrowFieldandField.createVectorfor every output type. Input spec derives SparkDataTypes viaUtils.fromArrowField. Exception paths close partially allocated vectors to avoid leaks. The ArrowFieldis computed once per(expression, schema)cache entry rather than per batch.docs/source/user-guide/latest/jvm_udf_dispatch.mdcovers the on/off config, supported and unsupported types, behavior notes, and the cross-query recompile caveat. Architecture lives in Scaladoc onCometScalaUDFCodegenandCometBatchKernelCodegen; in-codeTODOs carry the open items.How are these changes tested?
CometCodegenSourceSuite: generated-source assertions for every optimization and every complex-type shape.CometCodegenDispatchSmokeSuite: end-to-end correctness across the scalar and complex type surface (primitives, binary input and output, decimal precision boundaries, date / timestamp / timestampNTZ, array / struct / map round-trips including nested shapes and primitive-keyed maps), composed UDF trees, subquery reuse,TaskContextpropagation, per-task cache isolation across sequential runs, kernel-cache reuse across batches of one query, ScalaUDF as a child of a native Spark expression.CometCodegenDispatchFuzzSuite: randomized decimal identity fuzz at several null densities.CometScalaUDFCompositionBenchmark: Spark vs Comet with the dispatcher enabled vs disabled, over three composed-UDF shapes.